1 Introduction

To characterize the physiological variations of the human urine metabolome with age, body mass index (BMI), and gender, samples from a cohort of adult volunteers from the Saclay research institute (hence the “sac[lay]urine” name) have been analyzed by LC–HRMS (E. A. Thévenot et al. 2015).

In this study conducted by the MetaboHUB French Infrastructure for Metabolomics, urine samples from 184 volunteers were analyzed by reversed-phase (C18) ultrahigh performance liquid chromatography (UPLC) coupled to high-resolution mass spectrometry (LTQ-Orbitrap). Raw data are publicly available on the MetaboLights repository (MTBLS404).

This vignette describes the statistical analysis of the data set from the negative ionization mode (113 identified metabolites at MSI levels 1 or 2):

  • correction of signal drift (loess model built on QC pools) and batch effects (two batches)

  • variable filtering (QC coefficent of variation < 30%)

  • normalization by the sample osmolality

  • log10 transformation, sample filtering (Hotelling, decile and missing pvalues > 0.001) resulting in the HU_096 sample being discarded

  • univariate hypothesis testing of significant variations with age, BMI, or between genders (FDR < 0.05)

  • PCA exploration of the data set; ropls Bioconductor package (E. A. Thévenot et al. 2015)

  • OPLS(-DA) modeling of age, BMI and gender; ropls Bioconductor package (E. A. Thévenot et al. 2015)

  • selection of the features which significantly contributes to the discrimination between gender with PLS-DA, Random Forest, or Support Vector Machines classifiers; biosigner Bioconductor package (Rinaudo et al. 2016)

Methods from the metabolis, ropls, and biosigner packages used for the analysis of the sacurine dataset

Methods from the metabolis, ropls, and biosigner packages used for the analysis of the sacurine dataset

A Galaxy version of this analysis is available W4M00001 ‘Sacurine-statistics’ on the Workflow4metabolomics.org online infrastructure (Guitton et al. 2017)

2 Hands-on

Loading the metabolis package

suppressMessages(library(metabolis))

2.1 metRead: Reading the data

The metRead function reads the data sets and builds the ExpressionSet object. For additional information about ExpressionSet class, see the “An introduction to Biobase and ExpressionSets” documentation from the Biobase package.

3 table format used as input

3 table format used as input

sacSet <- metabolis::metRead(system.file("extdata/sacurine",
                                         package = "metabolis"))
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 113 features, 210 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: QC1_001 HU_neg_017 ... QC1_012_b2 (210 total)
##   varLabels: subset full ... gender (10 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: (2-methoxyethoxy)propanoic acid isomer
##     (gamma)Glu-Leu/Ile ... Xanthosine (113 total)
##   fvarLabels: database_identifier chemical_formula ... reliability
##     (10 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

2.2 metView: Looking at the data

sacSet <- metabolis::metView(sacSet)
## 
## 
## Data description:
## 
## observations: 210 
## variables: 113 
## missing: 0 
## 0 values (%): 0 
## min: 506.5 
## mean: 6700000 
## median: 670000 
## max: 6.8e+08 
## 
## Sample types:
## 
##   pool sample 
##     26    184

2.3 Post-processing

2.3.1 metCorrect: Correcting signal drift and batch effect

sacSet <- metabolis::metCorrect(sacSet)
## 
## Reference observations are:  pool 
## 
## Processing batch:
##  ne1
##  ne2

2.3.2 Variable filtering

features with QC coefficent of variation > 30%

  • using metView to compute the ‘pool_CV’ metric
sacSet <- metabolis::metView(sacSet)
## 
## 
## Data description:
## 
## observations: 210 
## variables: 113 
## missing: 0 
## 0 values (%): 0 
## min: 268.1565 
## mean: 6500000 
## median: 710000 
## max: 5.1e+08 
## 
## Sample types:
## 
##   pool sample 
##     26    184

  • discarding features with ‘pool_CV’ > 0.3
sacSet <- sacSet[Biobase::fData(sacSet)[, "pool_CV"] <= 0.3, ]
  • discarding the ‘pool’ observations
sacSet <- sacSet[, Biobase::pData(sacSet)[, "sampleType"] != "pool"]
print(sacSet)
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 110 features, 184 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: HU_neg_017 HU_neg_018 ... HU_neg_146_b2 (184 total)
##   varLabels: subset full ... deci_pval (15 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: (2-methoxyethoxy)propanoic acid isomer
##     (gamma)Glu-Leu/Ile ... Xanthosine (110 total)
##   fvarLabels: database_identifier chemical_formula ... pool_CV (11
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

2.3.3 Normalizing

osmolality

Biobase::exprs(sacSet) <- sweep(Biobase::exprs(sacSet),
                                2,
                                Biobase::pData(sacSet)[, "osmolality"],
                                "/")

2.3.4 metTransform: Transforming the data intensities

sacSet <- metabolis::metTransform(sacSet, "log10")
## 
## Missing values in the 'dataMatrix': 0 (0%)
## 
## Zero values in the 'dataMatrix': 0 (0%)
## 
## 'log10' transformation

2.3.5 Sample filtering

Hotelling, decile and missing p-values > 0.001

sacSet <- metabolis::metView(sacSet)
## 
## 
## Data description:
## 
## observations: 184 
## variables: 110 
## missing: 0 
## 0 values (%): 0 
## min: 0.1213445 
## mean: 3 
## median: 3 
## max: 5.9 
## 
## Sample types:
## 
## sample 
##    184

sacSet <- sacSet[, Biobase::pData(sacSet)[, "hotel_pval"] >= 0.001 &
                   Biobase::pData(sacSet)[, "miss_pval"] >= 0.001 &
                   Biobase::pData(sacSet)[, "deci_pval"] >= 0.001]

Final visual check of the data before performing the statistics

metabolis::metView(sacSet)
## 
## 
## Data description:
## 
## observations: 182 
## variables: 110 
## missing: 0 
## 0 values (%): 0 
## min: 0.1391968 
## mean: 3.1 
## median: 3 
## max: 5.9 
## 
## Sample types:
## 
## sample 
##    182

2.4 metTest: Univariate hypothesis testing

sacSet <- metabolis::metTest(sacSet,
                             factor = "gender",
                             testC = "limma",
                             adjustC = "BH")
## 
## Performing 'limma'

## 
## 41 variables (37%) were found significant at the 0.05 level (after the 'BH' correction).
## The first 15 are displayed below (sorted by increasing corrected p-values):
##                                   gender_limma_Female.Male_dif
## Testosterone glucuronide                             0.5815702
## p-Anisic acid                                       -1.0003177
## Malic acid                                          -0.2185078
## Pantothenic acid                                    -0.2158701
## Acetylphenylalanine                                 -0.2602737
## p-Hydroxyhippuric acid                              -0.1561200
## Citric acid                                         -0.1078489
## alpha-N-Phenylacetyl-glutamine                      -0.1120281
## Oxoglutaric acid                                    -0.2730331
## 4-Acetamidobutanoic acid isomer 3                   -0.1762354
## N-Acetyl-aspartic acid                              -0.1038041
## Gluconic acid and/or isomers                        -0.1319615
## Monoethyl phthalate                                 -0.3928070
## Glyceric acid                                       -0.1312902
## (gamma)Glu-Leu/Ile                                   0.1276751
##                                   gender_limma_Female.Male_BH
## Testosterone glucuronide                         2.713402e-12
## p-Anisic acid                                    5.281909e-11
## Malic acid                                       7.686227e-09
## Pantothenic acid                                 1.021174e-07
## Acetylphenylalanine                              5.422207e-06
## p-Hydroxyhippuric acid                           1.694287e-05
## Citric acid                                      4.098478e-05
## alpha-N-Phenylacetyl-glutamine                   8.992398e-05
## Oxoglutaric acid                                 1.313037e-04
## 4-Acetamidobutanoic acid isomer 3                1.952588e-04
## N-Acetyl-aspartic acid                           1.952588e-04
## Gluconic acid and/or isomers                     3.986868e-04
## Monoethyl phthalate                              1.094504e-03
## Glyceric acid                                    1.546311e-03
## (gamma)Glu-Leu/Ile                               1.758734e-03

2.5 Unsupervised analysis

2.5.1 PCA modeling

ropls Bioconductor package

(E. A. Thévenot et al. 2015)

suppressMessages(library(ropls))
sacPca <- ropls::opls(sacSet)
## PCA
## 182 samples x 110 variables
## standard scaling of predictors
##       R2X(cum) pre ort
## Total    0.526   9   0

plot(sacPca,
     parAsColFcVn = Biobase::pData(sacSet)[, "gender"],
                                   typeVc = "x-score")
## Warning: Character 'parAsColFcVn' set to a factor

plot(sacPca,
     parAsColFcVn = Biobase::pData(sacSet)[, "age"],
                                   typeVc = "x-score")

plot(sacPca,
     parAsColFcVn = Biobase::pData(sacSet)[, "bmi"],
                                   typeVc = "x-score")

sacSet <- ropls::getEset(sacPca)

2.5.2 metHeatmap: hierarchical clustering

sacSet <- metabolis::metHeatmap(sacSet, clustVi = c(5, 3))

2.6 Supervised modeling

2.6.1 PLS modeling

ropls Bioconductor package

(E. A. Thévenot et al. 2015)

sacPlsda <- ropls::opls(sacSet, "gender")
## PLS-DA
## 182 samples x 110 variables and 1 response
## standard scaling of predictors and response(s)
##       R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y  pQ2
## Total    0.278    0.731   0.562 0.261   3   0 0.05 0.05

sacSet <- ropls::getEset(sacPlsda)

2.6.2 Feature selection

biosigner Bioconductor package

(Rinaudo et al. 2016)

suppressMessages(library(biosigner))
set.seed(123)
sacSign <- biosigner::biosign(sacSet, "gender")
## Warning: 'y' character vector converted to a factor with levels 'Female',
## 'Male'
## Significant features from 'S' groups:
##                          plsda randomforest svm
## Testosterone glucuronide "S"   "S"          "S"
## p-Anisic acid            "S"   "A"          "S"
## Oxoglutaric acid         "C"   "S"          "S"
## Pantothenic acid         "A"   "B"          "S"
## Acetylphenylalanine      "B"   "B"          "S"
## Ketoleucine              "E"   "E"          "S"
## Malic acid               "S"   "E"          "E"
## Taurine                  "E"   "E"          "S"
## Accuracy:
##      plsda randomforest   svm
## Full 0.871        0.861 0.891
## AS   0.867        0.869 0.911
## S    0.884        0.860 0.906

set.seed(NULL)
sacSet <- biosigner::getEset(sacSign)

2.7 metWrite: Exporting the results

metabolis::metWrite(sacSet, dirC = getwd())

References

Guitton, Yann, Marie Tremblay-Franco, Gildas Le Corguillé G., Jean-Francois Martin, Melanie Pétéra, Pierrick Roger-Mele, Alexis Delabrière, et al. 2017. “Create, Run, Share, Publish, and Reference Your LC-MS, FIA-MS, GC-MS, and NMR Data Analysis Workflows with the Workflow4Metabolomics 3.0 Galaxy Online Infrastructure for Metabolomics.” The International Journal of Biochemistry & Cell Biology 93: 89–101. doi:10.1016/j.biocel.2017.07.002.

Rinaudo, Philippe, Samia Boudah, Christophe Junot, and Etienne A Thévenot. 2016. “Biosigner: A New Method for the Discovery of Significant Molecular Signatures from Omics Data.” Frontiers in Molecular Biosciences 3: –. doi:10.3389/fmolb.2016.00026.

Thévenot, Etienne A., Aurélie Roux, Ying Xu, Eric Ezan, and Christophe Junot. 2015. “Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses.” Journal of Proteome Research 14 (8): 3322–35. doi:10.1021/acs.jproteome.5b00354.